Abstract: As we know, the video game market is constantly growing and in this case it would be said exponentially due to its rapid boom in recent years. That is why the purpose of this research is to provide insights and results of the various exercises done in order for companies in this industry make decisions based on real data. This report will present the entire process of extraction, manipulation, and visualization of data from the videogame database. Likewise, this report is divided into 5 large sections where it can be seen that each one has a different exercise to comply with the aforementioned process.

Introduction: At the moment video games have become one of the most profitable and commercially attractive industries of this decade. In 2017 alone, the global video game market exceeded $ 100 billion, a figure that marked a milestone in the history of this segment, registering a growth of more than 50% in five years. The market is led by video games for mobile devices (42%), followed by video games for consoles (31%) and for computers (27%), which include physical, downloadable and browser video games.

Objective: my goal as a student and as a data analyst, on this occasion, is to clean and analyze a data set which will allow me to create a platform to visualize which video games have been the best-selling in the last few years as well as other interesting exploratory and descriptive analysis. All this work have the goal to identify user preferences and thus, be able to make decisions that open the opportunities for them building greater profitability.

1 Video games 1

In this section you will see the processes to fix the type of data, cleaning blank or null data, and the elimination of duplicates.

1.1 Fixing Data Type

We start by seeing how the ‘videogame’ data frame is composed with the Head function and with the Glimpse function it will tell us how many rows, columns, and the type of data within each column of the data frame video games. As a result, this tells us it has 12 columns and 16,622 rows. By data type, it appears numeric and mostly character.

When looking at the data type of each column in the data frame, several observations were found. First, I found that the “YEAR” column is categorized as a character when it should be an integer by its rounded number. Second, all the columns that have “SALES” in their name (NorthAmerica_Sales, Europe_Sales, RestofWorld_Sales, and Total_Sales) all their data are known as characters when it should be numeric by its decimal numbers. Finally, the columns “WON_AN_AWARD” and “GENRE” that it recognized its data as a character when it should be a factor because of their categories.

## # A tibble: 6 x 12
##   Name  Console Year  Genre Publisher NorthAmerica_Sa… Europe_Sales Japan_Sales
##   <chr> <chr>   <chr> <chr> <chr>     <chr>            <chr>              <dbl>
## 1 2002… PS2     year… Spor… Electron… 0.21             0.17                0.16
## 2 2002… XB      year… Spor… Electron… 0.1400000000000… 0.04                0   
## 3 2002… GC      year… Spor… Electron… 0.04             0.01                0   
## 4 2010… PS3     year… Spor… Electron… 0.3              0.64                0.07
## 5 2010… X360    year… Spor… Electron… 0.32             0.43                0.02
## 6 2010… PSP     year… Spor… Electron… 0.09             0.24                0.03
## # … with 4 more variables: RestofWorld_Sales <chr>, Total_Sales <chr>,
## #   Consumer_rating <dbl>, Won_an_award <chr>
## Rows: 16,622
## Columns: 12
## $ Name               <chr> "2002 FIFA World Cup", "2002 FIFA World Cup", "200…
## $ Console            <chr> "PS2", "XB", "GC", "PS3", "X360", "PSP", "Wii", "P…
## $ Year               <chr> "year 2002", "year 2002", "year 2002", "year 2010"…
## $ Genre              <chr> "Sports", "Sports", "Sports", "Sports", "Sports", …
## $ Publisher          <chr> "Electronic Arts", "Electronic Arts", "Electronic …
## $ NorthAmerica_Sales <chr> "0.21", "0.14000000000000001", "0.04", "0.3", "0.3…
## $ Europe_Sales       <chr> "0.17", "0.04", "0.01", "0.64", "0.43", "0.24", "0…
## $ Japan_Sales        <dbl> 0.16, 0.00, 0.00, 0.07, 0.02, 0.03, 0.01, 0.01, 0.…
## $ RestofWorld_Sales  <chr> "0.06", "0.01", "0", "0.22", "0.09", "0.12", "0.04…
## $ Total_Sales        <chr> "0.59", "0.19", "0.05", "1.23", "0.86", "0.48", "0…
## $ Consumer_rating    <dbl> 79, 40, 28, 82, 52, 74, 52, 61, 62, 75, 47, 72, 74…
## $ Won_an_award       <chr> "did not win", "did not win", "did not win", "won"…

As we can see, by using the function Summary, the numeric data type will calculate the minimum, median, mean, maximum, and quartiles. On the other hand, for the character data type, it will only appear the lengths, class, and mode.

##      Name             Console              Year              Genre          
##  Length:16622       Length:16622       Length:16622       Length:16622      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   Publisher         NorthAmerica_Sales Europe_Sales        Japan_Sales        
##  Length:16622       Length:16622       Length:16622       Min.   :-123.00000  
##  Class :character   Class :character   Class :character   1st Qu.:   0.00000  
##  Mode  :character   Mode  :character   Mode  :character   Median :   0.00000  
##                                                           Mean   :   0.03484  
##                                                           3rd Qu.:   0.04000  
##                                                           Max.   :  10.22000  
##  RestofWorld_Sales  Total_Sales        Consumer_rating  Won_an_award      
##  Length:16622       Length:16622       Min.   :  0.00   Length:16622      
##  Class :character   Class :character   1st Qu.: 17.00   Class :character  
##  Mode  :character   Mode  :character   Median : 36.00   Mode  :character  
##                                        Mean   : 38.65                     
##                                        3rd Qu.: 59.00                     
##                                        Max.   :100.00

Fixing the column “Year”: Why is Year not integer? We need to delete the “year” text in its data. By deleting this now will only appear the integer number of the year (Instead of “year 2002” it will only appear only “2002”). For this, we will be using the Mutate function to create a new column of the Year without “year” on it and the function Select to update the data frame with only the column that shows that Year is an integer.

videogame <- videogame %>% mutate(Year_trimmed = str_remove(Year, "year"),Years =as.integer(Year_trimmed)) %>% 
  select(Name, Console, Years, Genre, Publisher, NorthAmerica_Sales, Europe_Sales, Japan_Sales, RestofWorld_Sales, Total_Sales, Consumer_rating, Won_an_award)

Now it appears the minimum, median, mean, maximum, and quartiles for this new column

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1980    2003    2007    2006    2010    2020     271

Fixing the column “NorthAmerica_Sales, Europe_Sales, RestofWorld_Sales & Total_Sales”: In this case, we will do the same process to fix all the “Sales” columns in order to make them numeric instead of character. Here I use the As.numeric function to update the data frame by changing the old incorrect column to the new correct data type column.

videogame$NorthAmerica_Sales <- as.numeric(videogame$NorthAmerica_Sales)
videogame$Europe_Sales <- as.numeric(videogame$Europe_Sales)
videogame$RestofWorld_Sales <- as.numeric(videogame$RestofWorld_Sales)
videogame$Total_Sales <- as.numeric(videogame$Total_Sales)

Now we can observe that it appears the minimum, median, mean, maximum, and quartiles for all the columns that have “Sales” in their names.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.080   0.263   0.240  41.490      31
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0200  0.1547  0.1200 29.0200     638
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0100  0.0493  0.0400 10.5700     397
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0100  0.0700  0.1800  0.5504  0.4900 82.7400     381

Fixing the column “Genre”, “Won_an_award”, “Console” & “Publisher”: Here I use the As.factor function to update the data frame by changing the old incorrect column to the new correct data type column.

videogame$Won_an_award <- as.factor(videogame$Won_an_award)
videogame$Genre <- as.factor(videogame$Genre)
videogame$Console <- as.factor(videogame$Console)
videogame$Publisher <- as.factor(videogame$Publisher)

Now we see that this arrangement divided these columns into categories and counted how many are within each category.

summary(videogame$Won_an_award)
summary(videogame$Genre)
summary(videogame$Console)
summary(videogame$Publisher)

Let’s see a recap

Making a recap of what we have already seen and fixed here we use this new function (not seen in class) where it makes us a summary of all the data frame of videogames. This function is call Skim from the skimr library. This function tell us the number of columns, rows, the amount of data types we have, and a summary of each of the columns by the data type.

What this show?: 1. For character type columns, it shows us if there are missing values, min/max, unique data, and blank spaces. 2. For factor type columns, it shows us if there are missing values, unique data, and the amount of data that is per category. 3. For numeric columns, it shows us if there are missing values, complete_rate, the mean, standard deviation, p0, p25, p50, p75, p100, and a small histogram of the data.

# Need the skimr library for this function 
skim(videogame)
Data summary
Name videogame
Number of rows 16622
Number of columns 12
_______________________
Column type frequency:
character 1
factor 4
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Name 0 1 1 132 0 11493 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Console 0 1 FALSE 31 DS: 2165, PS2: 2164, PS3: 1329, Wii: 1328
Genre 0 1 FALSE 12 Act: 3317, Spo: 2354, Mis: 1740, Rol: 1489
Publisher 0 1 FALSE 582 Ele: 1355, Act: 976, Nam: 932, Ubi: 921
Won_an_award 0 1 FALSE 2 did: 16477, won: 145

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Years 271 0.98 2006.41 5.83 1980.00 2003.00 2007.00 2010.00 2020.00 ▁▁▃▇▂
NorthAmerica_Sales 31 1.00 0.26 0.82 0.00 0.00 0.08 0.24 41.49 ▇▁▁▁▁
Europe_Sales 638 0.96 0.15 0.57 0.00 0.00 0.02 0.12 29.02 ▇▁▁▁▁
Japan_Sales 0 1.00 0.03 1.92 -123.00 0.00 0.00 0.04 10.22 ▁▁▁▁▇
RestofWorld_Sales 397 0.98 0.05 0.19 0.00 0.00 0.01 0.04 10.57 ▇▁▁▁▁
Total_Sales 381 0.98 0.55 1.57 0.01 0.07 0.18 0.49 82.74 ▇▁▁▁▁
Consumer_rating 0 1.00 38.65 25.61 0.00 17.00 36.00 59.00 100.00 ▇▇▇▅▂

1.2 Cleaning the Data

Since we have the correct data types in each column, it is time to clean up the data frame. For this, we will be checking all duplicates, out of range values, and replace null values that may exist in the data.

Replacing NA for the Median: Now that we have the correct data type for the Years columns, it means that now it can calculate all dispersion measures and the amount of NA found in this column. As we can see it found 271 NA’s in these columns. To replace all the NA found here we are going to replace it with its median. As you can see in this first chunk these are all the calculation and count of NA in each column.This step is to check the before and after.

summary(videogame$Years)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1980    2003    2007    2006    2010    2020     271

Replacing NA for the median in the column -> Years

videogame$Years[which(is.na(videogame$Years))] = 2007

Now here we can see it doesn’t appear anymore NAs

summary(videogame$Years)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1980    2003    2007    2006    2010    2020

Replacing NA for the Mean:

summary(videogame$NorthAmerica_Sales)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.080   0.263   0.240  41.490      31
summary(videogame$Europe_Sales)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0200  0.1547  0.1200 29.0200     638
summary(videogame$RestofWorld_Sales)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0100  0.0493  0.0400 10.5700     397
summary(videogame$Total_Sales)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0100  0.0700  0.1800  0.5504  0.4900 82.7400     381

The process is the same for all columns. First we call the original data frame with the column to replace NA. Here we use the WHICH and IS.NA function to designate that where there is NA in the column X in the data frame Video Games should change it to the average. (The number is the average).

Replacing NA for the mean in the column -> NorthAmerica_r_Sale,Europe_r_Sales, RestWorld_r_Sales & video_games1$Total_r_Sales

videogame$NorthAmerica_Sales[which(is.na(videogame$NorthAmerica_Sales))] = 0.26
videogame$Europe_Sales[which(is.na(videogame$Europe_Sales))] = 0.15
videogame$RestofWorld_Sales[which(is.na(videogame$RestofWorld_Sales))] = 0.05
videogame$Total_Sales[which(is.na(videogame$Total_Sales))] = 0.55

Now here we can see it doesn’t appear anymore NAs

summary(videogame$NorthAmerica_Sales)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.080   0.263   0.240  41.490
summary(videogame$Europe_Sales)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0300  0.1545  0.1400 29.0200
summary(videogame$RestofWorld_Sales)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.01000  0.04927  0.04000 10.57000
summary(videogame$Total_Sales)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.0700  0.1900  0.5504  0.5300 82.7400

1.3 Eliminating Duplicates

In this part, we will analyze if there are duplicates in the database. Duplicate happens when all the features’ values within the observations are the same.

Identify FULL duplicates (7 Full duplicate)

sum(duplicated(videogame))
## [1] 7

Deleting duplicate

videogame <- distinct(videogame)

Now you can see there is no FULL duplicates

sum(duplicated(videogame))
## [1] 0

2 Video games 2

In this section we will only work with the columns that have a data factor type. Here the factors will be manipulated, grouped and cleaned by the categories in which each column is separated and an arrangement of format, text and spaces will be made.

2.1 Dealing with Factor Levels

As we see earlier there are 4 columns that have the factors data type, this columns are: “Genre”, “Won_an_award”, “Console” & “Publisher”

Analyzing Won_an_award: Let’s start by analyzing Won_an_award. As this only having two category it’s easy to identify that this column is correct. Similarly, levels of a factor can be checked using the levels function.

videogame %>% count(Won_an_award)
## # A tibble: 2 x 2
##   Won_an_award     n
## * <fct>        <int>
## 1 did not win  16470
## 2 won            145
levels(videogame$Won_an_award)
## [1] "did not win" "won"

Analyzing Publisher: In this case we can see that there are many categories in “Publisher” and that several of them have similar names. Example: 989 Sports and 989 Studios / Also Activision, Activision Blizzard and Activision Value. This leaves us told that this column must be reviewed.

videogame %>% count(Publisher)
## # A tibble: 582 x 2
##    Publisher                        n
##  * <fct>                        <int>
##  1 10TACLE Studios                  3
##  2 1C Company                       3
##  3 20th Century Fox Video Games     5
##  4 2D Boy                           1
##  5 3DO                             36
##  6 49Games                          1
##  7 505 Games                      192
##  8 5pb                             63
##  9 7G//AMES                         4
## 10 989 Sports                       1
## # … with 572 more rows

level function for Publisher will not be shown due to its large and extensive content

Analyzing Genre: As we can see, all the categories have different names and well written so we can deduce that this column is correct.

videogame %>% count(Genre)
## # A tibble: 12 x 2
##    Genre            n
##  * <fct>        <int>
##  1 Action        3317
##  2 Adventure     1291
##  3 Fighting       849
##  4 Misc          1740
##  5 Platform       888
##  6 Puzzle         582
##  7 Racing        1250
##  8 Role-Playing  1489
##  9 Shooter       1312
## 10 Simulation     867
## 11 Sports        2349
## 12 Strategy       681
levels(videogame$Genre)
##  [1] "Action"       "Adventure"    "Fighting"     "Misc"         "Platform"    
##  [6] "Puzzle"       "Racing"       "Role-Playing" "Shooter"      "Simulation"  
## [11] "Sports"       "Strategy"

Analyzing Console: Although this is a column with many categories and some names are similar, I consider it correct because each video game console has a different name. However, I consider that depending on the consoles we can group them later.This leaves us told that this column must be reviewed.

videogame %>% count(Console)
## # A tibble: 31 x 2
##    Console     n
##  * <fct>   <int>
##  1 2600      133
##  2 3DO         3
##  3 3DS       509
##  4 DC         52
##  5 DS       2164
##  6 GB         99
##  7 GBA       824
##  8 GC        556
##  9 GEN        27
## 10 GG          1
## # … with 21 more rows
levels(videogame$Console)
##  [1] "2600" "3DO"  "3DS"  "DC"   "DS"   "GB"   "GBA"  "GC"   "GEN"  "GG"  
## [11] "N64"  "NES"  "NG"   "PC"   "PCFX" "PS"   "PS2"  "PS3"  "PS4"  "PSP" 
## [21] "PSV"  "SAT"  "SCD"  "SNES" "TG16" "Wii"  "WiiU" "WS"   "X360" "XB"  
## [31] "XOne"

2.2 Categorical Data Problems

After analyzing the columns, it was concluded that we should check the column “Console” & “Publisher”

Grouping the column ‘Console’: We can see that the console column has many categories that can be grouped to make smaller categories. An example of this can be the Play Station where there is several categories like: PS, PS2, PS3, PS4 PSP and PSV.

First we see the amount of data by categories int he column ‘Console’ to see that we can join.

videogame %>% count(Console, sort = TRUE)
## # A tibble: 31 x 2
##    Console     n
##    <fct>   <int>
##  1 DS       2164
##  2 PS2      2163
##  3 PS3      1329
##  4 Wii      1328
##  5 X360     1268
##  6 PSP      1213
##  7 PS       1197
##  8 PC        961
##  9 GBA       824
## 10 XB        824
## # … with 21 more rows

Now we collapse or group the categories

videogame <- videogame %>% 
  mutate(Console = fct_collapse(videogame$Console,
                         PlayStation =  c("PS", "PS2", "PS3", "PS4", "PSP", "PSV", "SCD"),
                         NintendoDS = c("3DS", "DS"),
                         Xbox = c("X360", "XB", "XOne"),
                         Computers = c("PC", "PCFX"),
                         GameBoy = c("GB","GBA"),
                         NintendoWii = c("Wii","WiiU"),
                         Old_Consoles = c("2600", "3DO", "DC", "GC", "GEN", "GG", "N64","NES", "NG", "SAT", "SNES", "TG16", "WS"))) 

Now you can observe that the categories went from being 31 to only 7. Likewise, with the HEAD function we can see this new grouping in the data frame of video_games2

videogame %>% count(Console, sort = TRUE)
## # A tibble: 7 x 2
##   Console          n
##   <fct>        <int>
## 1 PlayStation   6659
## 2 NintendoDS    2673
## 3 Xbox          2306
## 4 Old_Consoles  1621
## 5 NintendoWii   1471
## 6 Computers      962
## 7 GameBoy        923

Grouping the column ’Publisher: Like the column Console, we can see that the console column has many categories that can be grouped to make smaller categories.

First, we are going to clean the data by trimming all the blank spaces and correct the string by making the first character of each word is uppercase.With this function now we have two new columns of publisher: one trimmed and the other with all the data having the first character of each word uppercase.

videogame <- videogame %>%  
mutate(Publisher=str_to_title(Publisher), Publisher= str_trim(Publisher))

Because of the mutate we have to fix the data type for the Publisher column to factor again.

videogame$Publisher <- as.factor(videogame$Publisher)

Then we see the amount of data by categories int he column ‘Console’ to see that we can join.

videogame %>% count(Publisher, sort = TRUE)
## # A tibble: 579 x 2
##    Publisher                        n
##    <fct>                        <int>
##  1 Electronic Arts               1351
##  2 Activision                     976
##  3 Namco Bandai Games             932
##  4 Ubisoft                        921
##  5 Konami Digital Entertainment   833
##  6 Thq                            715
##  7 Nintendo                       708
##  8 Sony Computer Entertainment    684
##  9 Sega                           639
## 10 Take-Two Interactive           413
## # … with 569 more rows

Now we collapse or group the some of the Publisher categories

videogame <- videogame %>%  mutate(Publisher= fct_collapse(videogame$Publisher,
                         Studios_989 =  c("989 Sports", "989 Studios"),
                         Activision = c("Activision","Activision Blizzard","Activision Value"),
                         Ascaron_Entertainment = c("Ascaron Entertainment","Ascaron Entertainment GmbH"),
                         Data_Publishing = c("Data Age","Data Design Interactive","Data East"),
                         Electronics = c("Electronic Arts","Electronic Arts Victor")))

Now we can see it group some of the Publisher categories

videogame %>% count(Publisher)
## # A tibble: 573 x 2
##    Publisher                        n
##  * <fct>                        <int>
##  1 10tacle Studios                  3
##  2 1c Company                       3
##  3 20th Century Fox Video Games     5
##  4 2d Boy                           1
##  5 3do                             36
##  6 49games                          1
##  7 505 Games                      192
##  8 5pb                             63
##  9 7g//Ames                         4
## 10 Studios_989                     15
## # … with 563 more rows

If we want to see the top 10 Publisher it would be like this:

videogame %>% count(Publisher, sort = TRUE)  %>% 
  head(10)
## # A tibble: 10 x 2
##    Publisher                        n
##    <fct>                        <int>
##  1 Electronics                   1353
##  2 Activision                    1006
##  3 Namco Bandai Games             932
##  4 Ubisoft                        921
##  5 Konami Digital Entertainment   833
##  6 Thq                            715
##  7 Nintendo                       708
##  8 Sony Computer Entertainment    684
##  9 Sega                           639
## 10 Take-Two Interactive           413

3 Video games 3

In this section we are going to I fix all the out of range values and create graph to visualize where all the NAs are. In addition, we are also identifying the missing pattern. At last, remap the publisher data using string distance and left join.

3.1 Replacing Out of Range Values

As we can saw in the section 1, in all sales columns there are several positive and negative out-of-range data. For this, it has been decided to change these values for a more accurate criteria of how these values out of range could be.

First of all, I want to highlight that a currency conversion will not be done due to two reasons: 1) apparently all these data seem to be in the same unit because the average of all is around 0.5. 2) Being very general places such as Europe and Rest of the world, these places have many countries that handle several different currencies so it cannot be justified by having only one currency. With that said, the criteria used for out-of-range data is that it will be replaced by it’s mean.

North America Sales

As we can saw in the graph, there are some out of range data. We ca filter with the function: assert_all_are_in_closed_range in order to see the specific out of range values.

assert_all_are_in_closed_range(videogame$NorthAmerica_Sales, lower = 0, upper = 10)

Now its time to replace all the out of range value to Na’s in order to visualize how this affect our data set and then replace it with the criteria stated.

videogame <- videogame %>%
  mutate(NorthAmerica_Sales = replace(NorthAmerica_Sales, NorthAmerica_Sales > 10.00000, NA))

Europe Sales

Then we filter the data. As we can see that Europe Sales have 2 out of range values

assert_all_are_in_closed_range(videogame$Europe_Sales, lower = 0, upper = 10)

Finally we replace the Europe sales is that all the data above 10 units by NA

videogame <- videogame %>%
  mutate(Europe_Sales = replace(Europe_Sales, Europe_Sales > 10.00000, NA))

Japan Sales

Then we filter the data. As we can see Japan Sales have 10 out of range value.

assert_all_are_in_closed_range(videogame$Japan_Sales, lower = 0, upper = 10)

Finally we replace the Europe sales is that all the data below 0 units by NA

videogame <- videogame %>%
  mutate(Japan_Sales = replace(Japan_Sales, Japan_Sales < 0, NA))

Rest of the World Sales

By filtering the data we can see that Rest of the World Sales have 3 out of range values

assert_all_are_in_closed_range(videogame$RestofWorld_Sales, lower = 0, upper = 10)

Finally we replace the Rest of the World Sales is that all the data above 10 units by NA

videogame <- videogame %>%
  mutate(RestofWorld_Sales = replace(RestofWorld_Sales, RestofWorld_Sales > 10.00000, NA))

Total_Sales

By filtering the data we can see that Total Sales have 2 out of range values

assert_all_are_in_closed_range(videogame$Total_Sales, lower = 0, upper = 40)

Finally we replace the Total_Sales data that is above 40 units by NA

videogame <- videogame %>%
  mutate(Total_Sales = replace(Total_Sales, Total_Sales > 40.00000, NA))

3.2 Visualizing & Fixing Missing Values

Now we are counting the missing values in the entire data set with the functions SUM and IS.NA. As we can see there is a total of 71 missing values in the data frame

sum(is.na(videogame))
## [1] 71

Also with the VIS_MISS function from the “visdat” library, we can see the nulls that are inside the complete data frame.

vis_miss(videogame) +
  labs(title = "Missing values per column", caption = "Figure 1.6: Missing data  analysis in videgames")

# Another great visualization tool is missing_plot()

Based on the outcome of the missing values graph, this can be concluded that this is NOT a severe case of missing data. This is because the graph shows that the errors occupy less than 0.1% of the total values, so it is considered a non-severe case.

Identifying the missing pattern

Here we use this new function (not seen in class) called missing_pattern and this produces a table and a plot showing the pattern of missingness between variables. the function is missing_pattern from the Finalfit library. Here appear the count of missing values in the column that apply. There are 10 patterns in this data. The number and pattern of missingness help us to determine the likelihood of it being random rather than systematic.

# Need the finalfit library for this function 
videogame %>% missing_pattern(explanatory = )

##       Name Console Years Genre Publisher Consumer_rating Won_an_award
## 16549    1       1     1     1         1               1            1
## 49       1       1     1     1         1               1            1
## 10       1       1     1     1         1               1            1
## 1        1       1     1     1         1               1            1
## 3        1       1     1     1         1               1            1
## 1        1       1     1     1         1               1            1
## 1        1       1     1     1         1               1            1
## 1        1       1     1     1         1               1            1
##          0       0     0     0         0               0            0
##       RestofWorld_Sales Total_Sales Europe_Sales NorthAmerica_Sales Japan_Sales
## 16549                 1           1            1                  1           1
## 49                    1           1            1                  1           0
## 10                    1           1            1                  0           1
## 1                     1           1            0                  1           1
## 3                     1           1            0                  0           1
## 1                     1           0            1                  0           1
## 1                     1           0            0                  1           1
## 1                     0           1            1                  1           1
##                       1           2            5                 14          49
##         
## 16549  0
## 49     1
## 10     1
## 1      1
## 3      2
## 1      2
## 1      2
## 1      1
##       71

As we analyze with the missing_pattern function we can conclude that this missing values are missing at random (MAR). This missing data values do not relate to any other data in the data set and there is some pattern to the actual values of the missing data themselves. When we filtered the data in the previous exercises we could see that all the data that was perceived as out of range was classified as NA. With this in mind, these targets are very random since they do not have any type of connection with other columns, or the conclusion per se.

3.3 Replacing NAs

Now that we know where is all the NAs thank to the previous graph. Now its time to replace all the NAs found in orden to have a clean data set. As we can see in the summary now it appear all the NA in the Sales column.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0800  0.2477  0.2400  9.8100      14
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.030   0.149   0.140   9.270       5
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##  0.00000  0.00000  0.00000  0.07914  0.04000 10.22000       49
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00000 0.00000 0.01000 0.04865 0.04000 8.46000       1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0100  0.0700  0.1900  0.5431  0.5300 35.8200       2

Finally, we replace all the NA by the mean of each column

videogame$NorthAmerica_Sales[which(is.na(videogame$NorthAmerica_Sales))] = 0.25
videogame$Europe_Sales[which(is.na(videogame$Europe_Sales))] = 0.15
videogame$Japan_Sales[which(is.na(videogame$Japan_Sales))] = 0.079
videogame$RestofWorld_Sales[which(is.na(videogame$RestofWorld_Sales))] = 0.048
videogame$Total_Sales[which(is.na(videogame$Total_Sales))] = 0.54

As we can see, now there are no more nulls

sum(is.na(videogame))
## [1] 0

And to confirm it let see the graph again

vis_miss(videogame) +
  labs(title = "Clean missing values", caption = "Figure 1.8: Missing data cleaning in videgames")

3.4 Remapping using string distance

Here we are going to remap using the string distance in the column of Publisher .Here I took all 500 publishers and export the list to an excel file, then clean it further and eliminate those that repeat or are wrongly writed.

publisher <- read_xlsx("Publisher.xlsx")

stringdist is a function that can easily solve most text data problems with a single piece of code. In this case, we are making a left_join with a database that has the correct names with the database with the wrong names.With this function, misspelled names will be identified and changed to the correct ones using a Damerau-Levenshtein change method.

Now we can see this run and work, and here you can visualize the head of the column just fix .

stringdist_left_join(videogame, publisher, by = "Publisher", method = "dl") %>% head(10)
## # A tibble: 10 x 13
##    Name  Console Years Genre Publisher.x NorthAmerica_Sa… Europe_Sales
##    <chr> <fct>   <dbl> <fct> <fct>                  <dbl>        <dbl>
##  1 2002… PlaySt…  2002 Spor… Electronics             0.21         0.17
##  2 2002… Xbox     2002 Spor… Electronics             0.14         0.04
##  3 2002… Old_Co…  2002 Spor… Electronics             0.04         0.01
##  4 2010… PlaySt…  2010 Spor… Electronics             0.3          0.64
##  5 2010… Xbox     2010 Spor… Electronics             0.32         0.43
##  6 2010… PlaySt…  2010 Spor… Electronics             0.09         0.24
##  7 2010… Ninten…  2010 Spor… Electronics             0.23         0.15
##  8 2014… PlaySt…  2014 Spor… Electronics             0.15         0.39
##  9 2014… Xbox     2014 Spor… Electronics             0.14         0.28
## 10 Tige… Xbox     2002 Spor… Electronics             0.35         0.08
## # … with 6 more variables: Japan_Sales <dbl>, RestofWorld_Sales <dbl>,
## #   Total_Sales <dbl>, Consumer_rating <dbl>, Won_an_award <fct>,
## #   Publisher.y <chr>

4 Video games 4

In this section we will see multiple graphs of different types for the representation of the data. Also we’ll crossed tables with factors and we analyzed text about the news of the rise of video games where we analyzed the repetitions of words, gave formats to the text and found insights.

4.1 From Counts to Proportions

Tables for 1 and 2 variables Since we will now work with factors it is important that for our first part we see the levels of the factors that we will use for today’s exercise. In this case we will see the levels of “Genre” and “Won an award”

Let’s do a cross tab between won an award and gender

For the exercise we will make a cross table of these two variables mentioned with the function Table. As we can see in the table, here they show which were the genres of video games that won a prize and those that did not with a frecuency format. From the results it is very obvious to conclude that the majority did not win.

table1<- table(videogame$Genre, videogame$Won_an_award)
t(table1)
##              
##               Action Adventure Fighting Misc Platform Puzzle Racing
##   did not win   3284      1289      846 1736      869    580   1231
##   won             33         2        3    4       19      2     19
##              
##               Role-Playing Shooter Simulation Sports Strategy
##   did not win         1475    1295        859   2329      677
##   won                   14      17          8     20        4

Cross tab with percentages

The tables with only the frequencies do not give us as many insights as a table shown in percentage. In this case that is what we will do when using the PROP.TABLE function and multiplying the values by 100.

In the results we can see that the genres of action, sports, and Misc are those with the highest percentage of prize loss. In the other case, we can see that the rate of genres that win prizes is very low.

prop.table(table1)*100
##               
##                did not win         won
##   Action       19.76527234  0.19861571
##   Adventure     7.75804995  0.01203732
##   Fighting      5.09178453  0.01805597
##   Misc         10.44839001  0.02407463
##   Platform      5.23021366  0.11435450
##   Puzzle        3.49082155  0.01203732
##   Racing        7.40896780  0.11435450
##   Role-Playing  8.87752031  0.08426121
##   Shooter       7.79416190  0.10231718
##   Simulation    5.17002708  0.04814926
##   Sports       14.01745411  0.12037316
##   Strategy      4.07463136  0.02407463

With the sum we can confirm that the percentages are with respect to the total so when adding all the decimals this has to give us 100

Now let’s compute the percentages by row:

In this case, here we compute the percentages by rows. In other words, the total percentage per row will be calculated in a base of 100%. From the results we can see that platform games have a 97% probability that they will not win and a 2% probability that they do. For the most part, all genres have a percentage of probability that they will not earn high enough and that they will earn very low.

prop.table(table1,1)*100
##               
##                did not win        won
##   Action        99.0051251  0.9948749
##   Adventure     99.8450813  0.1549187
##   Fighting      99.6466431  0.3533569
##   Misc          99.7701149  0.2298851
##   Platform      97.8603604  2.1396396
##   Puzzle        99.6563574  0.3436426
##   Racing        98.4800000  1.5200000
##   Role-Playing  99.0597717  0.9402283
##   Shooter       98.7042683  1.2957317
##   Simulation    99.0772780  0.9227220
##   Sports        99.1485739  0.8514261
##   Strategy      99.4126285  0.5873715

Being 12 rows, we can expect the sum of 100% of each would give us a sum of 1200

Now let’s compute the percentages by column:

On the contrary, now we are going to compute the percentages per column where each column represents a total of 100%. From the results we see that most of the genres of games that do not win prizes are action, sports and misc. On the contrary, the genres that win the most games are action, platform, racing and sports.

prop.table(table1, 2)*100
##               
##                did not win       won
##   Action         19.939284 22.758621
##   Adventure       7.826351  1.379310
##   Fighting        5.136612  2.068966
##   Misc           10.540376  2.758621
##   Platform        5.276260 13.103448
##   Puzzle          3.521554  1.379310
##   Racing          7.474196 13.103448
##   Role-Playing    8.955677  9.655172
##   Shooter         7.862781 11.724138
##   Simulation      5.215543  5.517241
##   Sports         14.140862 13.793103
##   Strategy        4.110504  2.758621

Since there are two columns, we expect the sum of the rows to be 200

4.2 Visualization of the Data

Information visualization is one of the most important processes when it comes to presenting information, extracting insights and making decisions. In this case, several visualization exercises will be shown through different types of graphs.

Basic chart: In this graph we can see the genres of video games that are most on the market (popularity). From the results we have that the genre of video games most played are action games with more than 3000 units. Second, we have sports games with more than 2000 units and in third we have misc games with more than 1500 units. On the contrary, we can see that the least popular game genre is that of puzzles and strategy.

Bar graph: In this graph we can see the results of the cross tables that I did in the last exercise. Here we can see all the video game genres with the number of video games that won an award in yellow and those that did not in red. As previously concluded, most of these video game genres did not win a award Here you can see the great difference between those who won (that is very minimal) and those who lost

Stacked bars: Here we can see the same results, only that the visualization method is different since this is a cumulative bar graph. Being a cumulative graph, we can see that both colors blue (lost) and green (won) together in the same bar of each category. It’s results are the same as the previous graph.

Histogram & Faceting: In this graph we can see in a unique way each category and its result in a different table. For better understanding, let’s take the example of the action genre. Here we can see by categories such as the rating given by users. The more video games there are per category, the greater the height of the bars, showing that they are the most repeated qualification in each one. The results:the action genred is the most popular and the one with the most highest rating.

Density plot: In this graph we can see a density graph divided into two colors: red (those that did not win a prize) and blue (those that did win). On the X axis is the rating of the consumer towards the different video games. As a result we can see the relationship of the ratings given with the probability that they win or not a prize. We can see that those who won prizes mostly have a rating greater than 75 points while those who lost have less than 65 point.

The middle line tells us what is the average rating that consumers give to video games and this tells us that it is approximately 30-35 points. These leave us told that most of these games did not win a prize and only a select percentage did win a prize.

Lines graph & facet_wrap: In this case, a chart was made to show how sales have been in the different regions of the world over the years. It should be emphasized that these graphs do not have nulls nor out of range values(this data were cleaned and replaced earlier). As we know, each region has a maximum of 10 units and a minimum of 0 units without counting the total sales that its maximum is 40 units(because it is the sum of the other graphs)

As we can see here, sales in the United States are the largest sales by region while in the opposite, sales to the rest of the world reporting the lowest sales. In general, we can see that in general, the graphs growth in sales is appreciated from 2005 and beyond. We can confirm this results with what is analyzed in the different news where it specifies the rise of video games were from 2003-2006.

4.3 Text Mining

Natural Language Processing The first thing we will do is import a text from the news about the rise of the video game market and convert it from text to corpus .

#Let's start by importing the text to analyze
text <- readLines(("Videogames News.txt"), warn=FALSE)
docs <- Corpus(VectorSource(text)) 

4.4 General Text Cleaning

In this part we take care of cleaning the text in different ways:

Convert the text to lower case

docs <- tm_map(docs, content_transformer(tolower))

Remove numbers

docs <- tm_map(docs, removeNumbers)

Removing english common stopwords

docs <- tm_map(docs, removeWords, stopwords("english"))

Remove punctuation

docs <- tm_map(docs, removePunctuation)

Eliminate extra white spaces

docs <- tm_map(docs, stripWhitespace)

4.5 Analyzing words and document frequency

Once the text has been cleaned and everything is formatted the same, we convert the document to matrix which is a table containing the frequency of the words.

dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
##              word freq
## games       games   12
## million   million   11
## will         will    9
## expected expected    9
## euros       euros    8
## reaching reaching    7
## years       years    7
## annual     annual    6
## grow         grow    6
## next         next    6

Word clouds: With the Word Cloud we can see which are the most repeated words in the text. The bigger the word means that it has more frequency in the text. In this case it will only show the word that have been repeated in the text more than 5 times. As a results: Games, Millions, Expected, Will and Euros are the words that repeat the most.

With the function findAssocs we can see all the words that are associated with the word ‘game’. In other words, here we can see the words that normally go together with the word “game” in the text and their association percentage above 50% (0.5).

findAssocs(dtm, terms = "games", corlimit = 0.5)
## $games
##          expected          although            annual           average 
##              0.82              0.79              0.71              0.71 
##              will           console               one           segment 
##              0.66              0.63              0.63              0.63 
##              hand          continue            social             euros 
##              0.63              0.60              0.60              0.57 
##        accessible             based            casual           channel 
##              0.55              0.55              0.55              0.55 
##      experiencing           general            public            simple 
##              0.55              0.55              0.55              0.55 
##        smartphone            tablet          targeted           usually 
##              0.55              0.55              0.55              0.55 
##       expectation              fall microtransactions            offset 
##              0.55              0.55              0.55              0.55 
##            online          physical              sale          reaching 
##              0.55              0.55              0.55              0.54 
##           million            growth 
##              0.51              0.51

At last, here we can see a graph of the top 10 words with the most repetition in the text in descending order. As a result we can see that the most repeated word is “games” and the least repeated is “grow” within the top 10.

barplot(d[1:10,]$freq, las = 2, names.arg = d[1:10,]$word,
        col ="lightsteelblue2", main ="Top 10 Most frequent words",
        ylab = "Word frequencies")

5 Video games 5

In this last section we will see the calculation of the central tendency measures, variability and shape measure. Also will see some correlation analysis and probability analysis.

5.1 Central Tendendcy, Variability & Shape Measure

For this exercise, we are going to filter the data to see only the data that occurred in 2006 (mean year of the data set) and we will call this new data frame Sales_2006

Sales_2006 <- videogame %>% filter(Years == 2006)

Here we will see a summary of the data already filtered in the new data frame.

summary(Sales_2006)
##      Name                   Console        Years               Genre    
##  Length:1010        Old_Consoles: 40   Min.   :2006   Action      :184  
##  Class :character   NintendoDS  :202   1st Qu.:2006   Sports      :139  
##  Mode  :character   GameBoy     : 39   Median :2006   Misc        :110  
##                     Computers   : 52   Mean   :2006   Role-Playing:110  
##                     PlayStation :476   3rd Qu.:2006   Racing      : 75  
##                     NintendoWii : 46   Max.   :2006   Adventure   : 71  
##                     Xbox        :155                  (Other)     :321  
##                         Publisher   NorthAmerica_Sales  Europe_Sales   
##  Electronics                 :102   Min.   :0.0000     Min.   :0.0000  
##  Konami Digital Entertainment: 66   1st Qu.:0.0000     1st Qu.:0.0000  
##  Namco Bandai Games          : 66   Median :0.0500     Median :0.0100  
##  Ubisoft                     : 61   Mean   :0.1955     Mean   :0.1130  
##  Thq                         : 57   3rd Qu.:0.1900     3rd Qu.:0.0375  
##  Nintendo                    : 55   Max.   :6.4200     Max.   :9.2300  
##  (Other)                     :603                                      
##   Japan_Sales      RestofWorld_Sales  Total_Sales      Consumer_rating 
##  Min.   :0.00000   Min.   :0.00000   Min.   : 0.0100   Min.   :  0.00  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.: 0.0400   1st Qu.: 11.00  
##  Median :0.00000   Median :0.01000   Median : 0.1300   Median : 27.00  
##  Mean   :0.07979   Mean   :0.05537   Mean   : 0.4495   Mean   : 33.12  
##  3rd Qu.:0.03000   3rd Qu.:0.03000   3rd Qu.: 0.3975   3rd Qu.: 54.00  
##  Max.   :6.50000   Max.   :8.46000   Max.   :30.0100   Max.   :100.00  
##                                                                        
##       Won_an_award 
##  did not win:1002  
##  won        :   8  
##                    
##                    
##                    
##                    
## 

Now, let’s compute the central tendency, variability, and shape statistical measures for the sum of all the sales in 2006 (total sales).

mean(Sales_2006$Total_Sales) #mean
## [1] 0.4494851
var(Sales_2006$Total_Sales) #variance
## [1] 2.529754
sd(Sales_2006$Total_Sales) #st deviation
## [1] 1.59052
median(Sales_2006$Total_Sales) #median
## [1] 0.13
quantile(Sales_2006$Total_Sales, 0.90) #percentile 90
##  90% 
## 0.94
min(Sales_2006$Total_Sales) #minimum
## [1] 0.01
max(Sales_2006$Total_Sales) #maximum
## [1] 30.01
range(Sales_2006$Total_Sales) #range
## [1]  0.01 30.01
skewness(Sales_2006$Total_Sales) #skewness
## [1] 13.92228
IQR(Sales_2006$Total_Sales) # interquartile range
## [1] 0.3575
diff(range(Sales_2006$Total_Sales)) # Range
## [1] 30

The coefficient of variation (CoV) is the ratio of the standard deviation to the mean. The higher the coefficient of variation, the greater the level of dispersion around the mean. Vice versa, the lower the value of the coefficient of variation, the more precise the estimate. In this case we can observe that the CoV is 3.5. This result is pretty low so we can confirm that this data is precise.

CoV <- sd(Sales_2006$Total_Sales)/mean(Sales_2006$Total_Sales)
CoV
## [1] 3.538537

The trimmed mean (similar to an adjusted mean) is a method of averaging that removes a small designated percentage of the largest and smallest values before calculating the mean. It helps eliminate the influence of outliers or data points on the tails that may unfairly affect the traditional mean. In this case the trimmed mean give us a result of 0.209 (Lower than the normal mean of 0.449)

#Trimmed mean
mean(Sales_2006$Total_Sales, trim = .1)
## [1] 0.2092327

Here we are going to add new column to the data frame in which will appear the Z score of the the data int he column total sales

Sales_2006 <- Sales_2006 %>% 
  mutate(Sales06_zscore = scale(Sales_2006$Total_Sales))

Now lets see if we have outliers. In this case we found that there are 7 of them by filter the 2006 Sales zscore are bigger than 3.

Sales_2006 %>% 
  filter(Sales06_zscore > 3)
## # A tibble: 7 x 13
##   Name  Console Years Genre Publisher NorthAmerica_Sa… Europe_Sales Japan_Sales
##   <chr> <fct>   <dbl> <fct> <fct>                <dbl>        <dbl>       <dbl>
## 1 Cook… Ninten…  2006 Simu… 505 Games             3.13         1.94        0.07
## 2 Fina… PlaySt…  2006 Role… Square E…             1.88         0           2.33
## 3 Gear… Xbox     2006 Shoo… Microsof…             3.54         1.9         0.07
## 4 New … Ninten…  2006 Plat… Nintendo              0.25         9.23        6.5 
## 5 Poke… Ninten…  2006 Role… Nintendo              6.42         4.52        6.04
## 6 The … Ninten…  2006 Acti… Nintendo              3.83         2.19        0.6 
## 7 Wii … Ninten…  2006 Misc  Nintendo              0.25         9.2         2.93
## # … with 5 more variables: RestofWorld_Sales <dbl>, Total_Sales <dbl>,
## #   Consumer_rating <dbl>, Won_an_award <fct>, Sales06_zscore[,1] <dbl>

Now we will see which are the top 10 total sales by name and publisher. In top 3 we have the Nintendo for the games New Super Mario Bros (30), Wii Play (29) and Pokemon Diamond/Pearl (18).

# Top 10 total sales
rank <- Sales_2006 %>% 
           select(Name, Publisher, Total_Sales, Sales06_zscore) %>% 
           arrange(desc(Total_Sales)) %>% head(10)
kable(rank)
Name Publisher Total_Sales Sales06_zscore
New Super Mario Bros. Nintendo 30.01 18.585442
Wii Play Nintendo 29.02 17.963004
Pokemon Diamond/Pokemon Pearl Nintendo 18.36 11.260793
The Legend of Zelda: Twilight Princess Nintendo 7.31 4.313379
Gears of War Microsoft Game Studios 6.11 3.558909
Final Fantasy XII Square Enix 5.95 3.458313
Cooking Mama 505 Games 5.72 3.313706
Guitar Hero II Redoctane 5.12 2.936471
Grand Theft Auto: Vice City Stories Take-Two Interactive 5.08 2.911322
Madden NFL 07 Electronics 4.49 2.540374

Now we are making a density plot in order to visualize the data that we previously discuss. For the total sales in 2006 we can see that the results are mostly right skewed, because almost all its data is concentrated in the right part of the graph (close to 0 on the X axis) and also has a large difference bias.

The skewness is a measure of the symmetry of a distribution. In this case we can see that this plot have a high skewness meaning that the plot is very asymmetrical. with the function skewness we can se the skewness of the plot above. As a result we got a skew of 13.92 units. This is very large difference bias for this chart.

skewness(Sales_2006$Total_Sales)
## [1] 13.92228

For better visualization we can transform the data into its logarithm form. With this we can visualize better the plot and also low the skewness. we got before.

Sales_2006 <- Sales_2006 %>%
  mutate(log_Total_Sales = log(Total_Sales))

Now we can see that the graph can be appreciated much better and now it is more symmetrical. This is because we used the logarithm values of the total sales unit.(The line in the middle show us the mean).

As we can see the comparison between the skewness of the first graph and the second have change a lot.With the logarithm form the skewness low from 13.9 units to 0.4

skewness(Sales_2006$log_Total_Sales)
## [1] 0.4173603

5.2 Graphing logarithm

Total_Sales x Won_an_award: In this case we are going to group the sales by the video games that won and did not win a prize. Here we do a groupby by Won_an_award and we make a summary containing the average, standard deviation, median, quantile of 90% and its interquartile range. Then for better visualization we make a transpose of the results.

From the results we can see that the average, standard deviation, median, quantile of 90% and the interquartile range of total sales in the 8 video games that did win a award is much higher than the 1002 video games that did not win a award.

Won_an_award mean(Total_Sales) sd(Total_Sales) median(Total_Sales) quantile(Total_Sales, (0.9)) IQR(Total_Sales) n()
did not win 0.4080938 1.291042 0.12 0.918 0.34 1002
won 5.6337500 9.763891 1.70 13.823 4.58 8

This Box plot confirms the previous results. Here we can see that the total sales are higher (the double sale)for the video games that won an award rather than the other that not won in exception of two cases that not won an award but got a high sells. (This graph have the total sales in logarithm in order to have for better visualization and short the skewness.)

Total_Sales x Genre: Like the previous exercise, we are going to group the sales by the video games genre with a groupby. We will calculate the average, standard deviation, median, quantile of 90% and its interquartile range for this exercise.

Genre mean(Total_Sales) sd(Total_Sales) median(Total_Sales) quantile(Total_Sales, (0.9)) IQR(Total_Sales) n()
Action 0.3647826 0.8068450 0.11 0.801 0.2900 184
Adventure 0.1767606 0.2894960 0.05 0.500 0.1650 71
Fighting 0.4296364 0.5929773 0.19 1.300 0.4700 55
Misc 0.6467273 2.8412057 0.09 1.100 0.2825 110
Platform 0.9322222 4.0942273 0.15 0.900 0.2800 54
Puzzle 0.3037209 0.5623972 0.09 0.550 0.2500 43
Racing 0.4617333 0.6447215 0.14 1.244 0.5950 75
Role-Playing 0.5297273 1.8890906 0.18 0.551 0.2800 110
Shooter 0.5795652 1.0341243 0.15 1.524 0.4900 69
Simulation 0.3939655 0.8173683 0.16 0.854 0.3975 58
Sports 0.4038129 0.6784973 0.19 0.796 0.4400 139
Strategy 0.1016667 0.1576801 0.04 0.227 0.0675 42

This Box plot confirms the previous results. Here we can see that most of the total sales in 2006 were thanks to the sales of racing, racing, fighting and sports video games. These three were the ones that had the highest sales except for some cases of platform games, misc and role playing games that had games that had great sales too.(This graph have the total sales in logarithm in order to have for better visualization and short the skewness.)

5.3 Correlation Analysis

For this practice lets investigate the association between NorthAmerica_Sales and Europe_Sales.

To know the correlation between data we will use this new function (not learned in class) called cor.test. This function tests for association between paired samples, using one of Pearson’s product moment correlation coefficient, Kendall’s or Spearman’s. This correlation test returns both the correlation coefficient and the significance level (or p-value) of the correlation.

In this case we will see the correlation between NorthAmerica_Sales and Europe_Sales (the two main region of video games sales) with the Pearson’s product-moment correlation.

cor.test(videogame$NorthAmerica_Sales, videogame$Europe_Sales)
## 
##  Pearson's product-moment correlation
## 
## data:  videogame$NorthAmerica_Sales and videogame$Europe_Sales
## t = 119.68, df = 16613, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6721886 0.6885212
## sample estimates:
##       cor 
## 0.6804394

In the result above we have : 1. t is the t-test statistic value (t = 119.68), 2. df is the degrees of freedom (df= 16613), 3. p-value is the significance level of the t-test (p-value = 2.2e-16). 4. conf.int is the confidence interval of the correlation coefficient at 95% (conf.int = [0.6721886 0.6885212]); 5. sample estimates is the correlation coefficient (Cor.coeff = 0.6804394).

Due to this result, the p-value of the test is 2.2e-16, meaning both groups have significance differences in the mean values of that variable.

With this plot we can assume that both columns are some how correlate in some parts (mostly in the right skew of the plot)

In this graph we can see the relationship between NorthAmerica_Sales and Europe_Sales categorized by awards they have won. As we can see, both lines show no relationship in the results. For example in the right line in the graph of those who did not win we have that North America has many video games sold that did not win compared to Europe in certain aspects. Similarly in the graph of the video games sold that won an award.

5.4 Probability using Normal Distribution

The first step is to take the average and the standard deviation that we will use to obtain the probabilities.

Now we are computing basic probabilities with the function pnorm that returns the integral from −∞ to q of the function of the normal distribution where q is a Z-score. We input the mean and the standar deviation of Total Sales and input the data we want to be grater than (P(x<2)). This give us a a result of 0.85, meaning that it is near the mean and have a 85% of chance of this values being smaller than 2

pnorm(2, mean = 0.5431116, sd = 1.384904, lower.tail = T)
## [1] 0.8535952

The function rnorm simulates a normally distributed variable. In this case we will follow the normal distribution for the column Total_Sales in its different range. The normal distribution is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean

n10 <- rnorm(10, mean = 0.5431116, sd = 1.384904)
n100 <- rnorm(100, mean = 0.5431116, sd = 1.384904)
n1000 <-  rnorm(1000, mean = 0.5431116, sd = 1.384904)
n10000 <-  rnorm(10000, mean = 0.5431116, sd = 1.384904)

In the histogram bellow we can see all the normal distribution vectors break it in 10, 100, 1,000 and 10,000. Insight: As the size of the vector increases the more symmetrical the histogram gets because it follow the normal distribution. This is the proof that this follow the central limit theorem that sate “the higher the sample size of number of observations, the more normally distributed that he distribution become.”

6 Conclusion

To conclude, with all the analysis done we can draw many insides that can be of use and value for companies in the video game industry. All this process of data extraction, manipulation, cleaning and visualization was very useful to test our ability to analyze data. However, I consider all this a very enriching process to develop our practical skills of data manipulation in Rstudio and the expansion of knowledge on the subject.

As we have been saying from the beginning, the video game industry is an industry that is constantly growing. We were able to approve this thanks to all the analysis done with the videogames database and with the validation of information with the news found about the growth of this market. I would describe the behavior of video game sales with an almost linear growth. I say this because with the graphics seen, over the years, sales have increased, although there have been years where they have fallen a little, they have risen again and with much more force. In addition, taking into account the pandemic, according to the news, video game sales have doubled, which can be an evaluated as a exponential growth in the sales.

Given the insights and results found, I can say with certainty that the video game market is a market that I would support, because of its growth in recent years. Also according to the news, this market may increase more due to the technological advances that we will see and the new target market that is in view.

6.1 Main Insights:

1. Average sales of video games in the last decades in North America, Europe, Japan and worldwide

## [1] 0.2369697
## [1] 0.1970909
## [1] 0.06153333
## [1] 0.0680404

The the average sales of the last decade on each region are: North America 0.24, Europe 0.20, Japan 0.06 and Rest of the word 0.06.

2. Year in which global video game sales were at its highest: The highest year was between 2006-2010:

3. Game console that has released the majority of video games: The console that has released the majority of video games are PlayStation

4. Genre with the most video games: The genre with the most video games is the action genre.

5. Publisher who has distributed the majority of video games: Here we have the top 10 publishers. On fist place we have Electronics, in second Activision and in third Namco Bandai Games.

Publisher n
Electronics 1353
Activision 1006
Namco Bandai Games 932
Ubisoft 921
Konami Digital Entertainment 833
Thq 715
Nintendo 708
Sony Computer Entertainment 684
Sega 639
Take-Two Interactive 413

6. Video games sales on each region over 40 years: As we can see here, sales in the United States are the largest sales by region while in the opposite, sales to the rest of the world reporting the lowest sales. In general, we can see that in general, the graphs growth in sales is appreciated from 2005 and beyond. We can confirm this results with what is analyzed in the different news where it specifies the rise of video games were from 2003-2006.

7. Correlation between consumer rating and sales Here we can see that the relationship of consumer ratings and total sales are well aligned. Here in the graph we can see that the higher the rating, the higher the sales due to the popularity and reputation of the video game. On the other hand, we can see that the lower the rating, the lower the sale of these video games.

8. Difference in rating and sales by won or not an award In this graph we can see a density graph divided into two colors: red (those that did not win a prize) and blue (those that did win). On the X axis is the rating of the consumer towards the different video games. As a result we can see the relationship of the ratings given with the probability that they win or not a prize. We can see that those who won prizes mostly have a rating greater than 75 points while those who lost have less than 65 point. The middle line tells us what is the average rating that consumers give to video games and this tells us that it is approximately 30-35 points. These leave us told that most of these games did not win a prize and only a select percentage did win a prize

9. Video games Genre with the highest number of video games that have won an award: As we can see in the table, here they show which were the genres of video games that won a prize and those that did not with a frequency format. From the results it is very obvious to conclude that the majority of them didn’t win. Even though the majority of them didn’t win, the genre with the most award is the action genre with 33 and Sports with 20.

did not win won
Action 3284 33
Adventure 1289 2
Fighting 846 3
Misc 1736 4
Platform 869 19
Puzzle 580 2
Racing 1231 19
Role-Playing 1475 14
Shooter 1295 17
Simulation 859 8
Sports 2329 20
Strategy 677 4

10. Most popular console through the years In the case of old consoles, we can see that they have their peak was from the 1990s to 2003 approximately, while the gaming computers (which are the newest) are the most popular recently with a boom period from 2007 to 2011 approximately. The points that we see represent the cases that are outside the range estimated by the box plot as we see with the Nintendo DS. These had their popularity during 2007 and 2012 and with cases in 2020 (the point seen).

6.2 Reference

Daniel Quintana. (2019, June 5). Exploring, cleaning, and analysing data in R [Video file]. Retrieved from https://www.youtube.com/watch?v=Ap1Q2fkqO_I

STHDA. (n.d.). Correlation Test Between Two Variables in R - Easy Guides - Wiki - STHDA. Retrieved April 29, 2021, from http://www.sthda.com/english/wiki/correlation-test-between-two-variables-in-r

Harrison, E. (n.d.). Missing data. Retrieved April 29, 2021, from https://cran.r-project.org/web/packages/finalfit/vignettes/missing.html

Av451, R. (2018, October 30). El sector de videojuegos crecerá en España un 3,6 por ciento en el próximo lustro. Retrieved April 29, 2021, from https://www.audiovisual451.com/el-sector-de-videojuegos-crecera-en-espana-un-36-por-ciento-en-el-proximo-lustro/

Herrero, P. (2020, December 26). La industria del videojuego facturó en 2020, en todo el mundo, más que el cine y los deportes juntos en EEUU. Retrieved April 29, 2021, from https://as.com/meristation/2020/12/26/noticias/1608992024_963325.html#:%7E:text=Las%20cifras%20de%20los%20videojuegos%20en%202020&text=Sin%20embargo%2C%20por%20otro%20lado,de%20hardware%20como%20de%20software.

Ventas de videojuegos crecen a triple dígito en la cuarentena. (2020, May 4). Retrieved April 29, 2021, from https://www.portafolio.co/negocios/ventas-de-videojuegos-crecen-a-triple-digito-en-la-cuarentena-540506